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Abstract — In this paper we focus on the challenging problem 
of place categorization and semantic mapping on a robot with¬ 
out environment-specific training. Motivated by their ongoing 
success in various visual recognition tasks, we build our system 
upon a state-of-the-art convolutional network. We overcome 
its closed-set limitations by complementing the network with a 
series of one-vs-all classifiers that can learn to recognize new 
semantic classes online. Prior domain knowledge is incorpo¬ 
rated by embedding the classification system into a Bayesian 
filter framework that also ensures temporal coherence. We 
evaluate the classification accuracy of the system on a robot 
that maps a variety of places on our campus in real-time. 
We show how semantic information can boost robotic object 
detection performance and how the semantic map can be used 
to modulate the robot’s behaviour during navigation tasks. The 
system is made available to the community as a ROS module. 

1. Introduction 

To become truly ubiquitous, mobile service robots that 
operate in human-centered complex indoor and outdoor 
environments need to develop an understanding of their 
surroundings that goes beyond the ability to avoid obstacles, 
autonomously navigate, or build maps. Truly useful robots 
need to be able to extract semantic information about the 
place they operate in [1]. Instead of merely answering the 
question of “Where am I?” [2] that often links to the 
problems of localization or SLAM, robots should also know 
“What is this place like?” to aid higher level decision pro¬ 
cesses, to infer high-level information about an environment, 
to ease robot-human interaction and to modulate the robot’s 
behaviour. The problem of assigning a semantic label to 
places or parts of the environment is often referred to as place 
categorization [3], [4], or - when combined with creating a 
map (see Fig. 1) - semantic mapping [5], [6]. To address 
this challenge we focus on transferable and expandable 
semantic place categorization and mapping for robotics. 
Transferable means the place categorization does not require 
environment-specific training. To achieve this, we leverage 
the recent success of convolutional networks (ConvNets) in 
the computer vision community where networks were re¬ 
cently trained specifically for the task of place categorization 
[7]. In contrast to state-of-the-art semantic mapping systems 
in robotics [3], [5], [8], [9] these networks generalize well 
and do not have to be re-trained or fine-tuned for specific 
environments. However, ConvNets can only recognize the 
classes they have been trained on. This closed-set constraint 
is ubiquitous in computer vision but poses an important 
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Fig. 1: A semantic map generated by the system described 
in this paper. The colors encode the semantic categories of 
different places encountered in the environment. The figure 
shows a map of an office environment (orange) with a 
kitchenette (dark green) and a long corridor (light green). 


limitation for robotics applications that aim at long-term 
autonomous operations and life-long learning. We overcome 
the closed-set limitation and present a novel expandable 
classification system by complementing the ConvNet with 
one-vs-all classifiers that are computationally cheap to train 
and can learn to recognize new classes online. In typical 
benchmark applications in computer vision each image is 
treated individually. In this paper we exploit the fact that 
robots see a temporally coherent stream of image data and 
embed the semantic classifier in a Bayesian filter framework. 
This allows us to correct spurious misclassifications and 
incorporate prior knowledge. We demonstrate that more 
coherent results are achieved when interpreting semantic 
classification as a probabilistic estimation problem with first 
order Markov properties. In addition to applying a state-of- 
the-art ConvNet-based classifier on a robot, and showing that 
no environment-specific training is required for robotic place 
categorization, our paper provides the following contribu¬ 
tions: 

1) We overcome the closed-set limitation of the ConvNet 























classifier by training computationally cheap additional 
one-vs-all classifiers that can learn to recognize new 
classes online. 

2) To benefit from the temporal coherence between con¬ 
secutive camera images, we integrate this dual classifi¬ 
cation system in a Bayesian filter framework. 

3) We combine the place categorization with a robotic 
mapping system and test it in real-time on a large dataset 
in a variety of indoor and outdoor places. 

4) We demonstrate that semantic place information can 
boost object detection and recognition on a robot. 

5) We demonstrate how a semantic map can be used to 
modulate robotic behaviour in navigation tasks. 

6) We provide the complete system to the community as 
a ROS module. 

To the best of our knowledge such a comprehensive robotic 
semantic mapping system has not been proposed before. 
In the following we discuss related work in the field in 
Section II before introducing our system in Section III. 
Section IV presents the experimental setup and the dataset 
used for evaluation. Finally we draw conclusions and discuss 
future work in Section VII. 

II. Related Work 

The topic of vision-based semantic scene categorization 
has been explored both in the robotics and computer vision 
community. The SUN (Scene UNderstanding) challenge ini¬ 
tiated by Xiao and Torralba et al. has driven this field of 
research forward and resulted in a number of benchmark 
papers, e.g. [4], [10]. Very recently, [7] achieved a significant 
improvement of place semantic categorization on the SUN 
benchmark by training a convolutional network for this task. 

A. Semantic Mapping in Robotics 

Wu et al. [11] defined the visual place categorization prob¬ 
lem for robotics using a purely appearance-based method 
based on CENTRIST features [9]. Their system has been 
trained on image sequences collected in six different apart¬ 
ment houses and was able to distinguish between typical 
semantic room categories like living room or bathroom. A 
similar system has been proposed by [8] and was tested on 
the same dataset. In contrast to [11] they use SIFT features 
extracted in a dense grid. Although both papers use a leave- 
one-out approach for training and testing (i.e. training on data 
collected in 5 of the 6 houses, and testing on the 6th), the vi¬ 
sual similarity between the apartments and therefore the sim¬ 
ilarity between training and testing data is high. In contrast 
to the aforementioned purely vision-based methods, Zender 
et al. [12] combine laser-based place categorization with 
vision-based object detection and a “commonsense ontology” 
to build a so called conceptual map of the environment 
that contains semantic information. In [3], Pronobis et al. 
examined another place classification system that combined 
sensor data from a camera and a 2D laser range finder to 
assign semantic labels like corridor, office or meeting room 
in an office building. They combine this with a metric SLAM 
system to accumulate class labels over time and generate 


a semantic map of the environment. Although the authors 
separate their training and testing environments by using 
different fioors of the same building, the visual similarity 
between the training and testing instances is very high. In 
later work [5], the same authors extended their system to 
include more sensor modalities like object information, or 
human input in addition to the visual scene appearance and 
the geometry information obtained by the laser. Again they 
separate training and testing environment by using different 
fioors in the same building. Unfortunately, their dataset is 
not publicly available, so no benchmarking against their 
results was possible. Apart from appearance based methods, 
other authors explored semantic mapping that relies on the 
detection of objects to infer semantic information about the 
current place (e.g. a cereal box is more likely to be spotted 
in a kitchen, while a stapler indicates being in an office). An 
early example is [13]. Another research topic that is closely 
related to semantic mapping is the discovery of places as 
recently demonstrated by Murphy and Sibley [14] as well as 
Paul et al. [15], [16]. In contrast to the semantic mapping 
methods described above, this research focussed on finding 
meaningful clusters in the sensor data acquired by a robot 
over longer periods of time and did not assign semantic labels 
to them. These summaries of perception or experience over 
time can ease the generation of annotated training data over 
time. 

B. Features for Semantic Classification 

A commonality between all of the aforementioned vision- 
based systems is that they rely on a fixed set of hand crafted 
features that are extracted from the images and then used 
for classification. For example [8] uses dense SIFT, [11] 
uses CENTRIST [9], [5] relies on SURF and CRFH and 
so on. However, a recent trend in computer vision, and 
especially in the field of object recognition and detection, is 
to exploit learned features using deep convolutional networks 
(ConvNets). The most prominent example of this trend is the 
annual ImageNet Large Scale Visual Recognition Challenge 
where for the past two years many of the participants have 
used ConvNet features [17]. The concept of convolutional 
networks is not new and was proposed by LeCun et al. [18] to 
recognize hand-written digits. Their popularity has risen ever 
since algorithmic improvements such as dropout and rectified 
linear units [19], [20] and the widespread availability of 
GPUs to train these models. Several research groups have 
shown that ConvNets outperform more classical approaches 
for object classification or detection that are based on hand¬ 
crafted features [21-25]. Recently [7] used ConvNets to 
beat all competing approaches on the task of semantic place 
categorization. 

C. The Closed-Set Limitation 

An important limitation of ConvNets and many other 
classifiers is the implicitly built in closed-set assumption. 
The classifier is trained on a fixed set of classes and is 
never presented a new or unknown class during test time. 
While this constraint is widespread in many computer vision 



and machine learning benchmarks [26], it is not a realistic 
assumption to make in long-term robotics applications. Over¬ 
coming this limitation is currently an active field of research 
in machine learning and computer vision [26-28], but did 
not yet converge to a widely accepted solution. 

D. Summary 

Despite the large body of work on semantic mapping and 
scene classification in the robotics community, most research 
systems lack a clear separation between training and testing 
data which makes their transferability and generalizabilty 
questionable. Also, to the best of our knowledge, the combi¬ 
nation of a convolutional network, a set of computationally 
cheap one-vs-all classifiers for online learning of new classes, 
and a Bayes filter that enforces temporal coherence has not 
been explored for vision-based semantic mapping before. 

III. System Overview 

Our proposed place categorization and semantic mapping 
system consists of four main parts: 

1) a convolutional neural network that classifies each im¬ 
age individually, 

2) one-vs-all classifiers that recognize scene classes the 
network was not trained on, 

3) a Bayesian filter to exploit temporal coherence and 
remove spurious false classifications, and 

4) the mapping subsystem that gradually builds a map 
using the resulting place labels. 

We describe each part in the following, before presenting 
experiments and evaluations in the next section. We provide 
the complete system as a ROS module for download on our 
website http://tinyurl.com/semantic-mapping-QUT. 

A. Transferable Place Categorization 

To classify each camera image individually, we lever¬ 
age the recent successes of Convolutional Networks for 
various visual recognition tasks. We use the Places205 
network recently published by Zhou et al. [7] since it is 
the state of the art in place categorization and outperforms 
all competing methods on various benchmark datasets ^ The 
Places205 network follows the same principled architec¬ 
ture as AlexNet [21] but was trained specifically for the 
task of place categorization. The training dataset comprised 
2.5 million images of 205 semantic categories, with at least 
5,000 images per category. The images originated from 
several internet sources such as Google Images, Bing, and 
Flickr. The images were labeled by human workers using 
Amazon Mechanical Turk. The large number and variety 
of the training dataset ensures that the resulting classifier 

Tn an earlier version of this paper we trained a nonlinear SVM on the 
fc7 layer of the AlexNet provided by Caffe [29]. This system achieved 
a top-1 accuracy of 46.2% on the SUN-397 benchmark, thus achieving state- 
of-the-art performance, only beaten by [30] at that time which achieved 
47.2%. While this paper was in preparation, [7] published their special¬ 
ized Places205 network which achieves 54.3%, thus outperforming all 
previous approaches by a large margin. We therefore decided to switch to 
Places205. 


generalizes well and does not need to be retrained or fine- 
tuned when deployed in environments it has not seen during 
training. This ensures that our semantic mapping system is 
transferable and can be deployed on any robot in a variety 
of environments. The input to the Places2 05 ConvNet are 
RGB images that are resized to 227x227x3 pixels, indepen¬ 
dent of their initial aspect ratio. The network’s output layer 
prob represents the discrete probability distribution p(xt \It) 
over all 205 known classes Xi, given the current image 2^. 
The network processes a single image in approximately 30 
ms on a Nvidia Qudaro K4000 GPU which is more than 
sufficient for typical robotics applications. 

B. Expandable Place Categorization 

A major difference between the computer vision commu¬ 
nity and robotics is the closed set assumption. Most object 
detection or scene classification benchmarks in computer 
vision assume that all classes are known during training, 
and that the classifier is presented only images of one of the 
known classes during testing [7], [17]. This is called closed 
set classification. However, research in robotics aims at 
life-long operations and long-term autonomy over extended 
periods of time. Inevitably, the robot will be faced with 
scene categories^ that were not part of the initial training 
set, but are important for the robot’s mission. Being able to 
extend the classification framework with new classes during 
deployment therefore is crucial. We show how the place 
categorization based on the Places2 05 network can be 
expanded by a set of new classes pi that are not part of 
the original training set: We propose to train a one-vs-all 
classifier that distinguishes the new class pi from the al¬ 
ready known classes {a;o...n 5 The advantage of this 

approach is that it is not necessary to retrain the ConvNet, 
which would be computationally expensive (typical training 
times are in the order of days) and would require a lot 
of training images (in the order of hundreds or thousands) 
of the new class. In contrast, a Random Forest one-vs-all 
classifier can be trained in under a minute using only a 
few (in the order of 10-100) training images. We let the 
classifier use the output of the f c 7 layer of the P1 ace s 2 0 5 
network as a feature vector. The f c 7 layer is the last generic 
(i.e. class independent) fully connected layer in the network. 
The layers fc8 and prob have 205 output neurons since 
they are specifically tailored for the task of recognizing 
the 205 classes from the training dataset. As mentioned 
before, p(xt|2t) - the discrete probability distribution over 
n = 205 class labels Xi - is the classification result of 
the Places205 network, given the current image Xf. Now 
p{pi\Xt) denotes the result of one of the one-vs-all classifiers 
that is trained to classify the new class pi. Let 

X=(xo,Xi,...Xn,yo,...,ym) (1) 

^The same is true for object recognition tasks, where the 1000 object 
classes in ImageNet might not be sufficient or specialized enough for 
robotics tasks. 



denote the combined vector of class labels. Then we define 
the combined likelihood £(Xt|xt) as 

C{It\±t) = {p{xo\It),... ,p{Xn\It),p{yo\It), ■ ■ ■ ,p{ym\It)) 

( 2 ) 

Re-normalization distributes the probability between the n 
classes known to the ConvNet classifier and the m additional 
classes known to the one-vs-all classifiers in a natural way. 
Notice that this assumes independence between the class 
labels xo...n and as well as pairwise independence 

between any yi and yj. 

C. Bayesian Filtering over Class Labels for Temporal Co¬ 
herence 

Typical computer vision benchmarks for place categoriza¬ 
tion or object detection treat each image individually [7], 
[17]. In contrast, most of the observed sensor data in robotics 
have a temporal dimension. Knowing that two images were 
observed consecutively can be a strong source of information 
that can be exploited by using Bayesian filtering techniques. 
We interpret the robotic place categorization problem as a 
probabilistic estimation problem and estimate the discrete 
distribution p(xt|Xo:t) over all possible place labels Xi, 
given all the observed images Xo:t from the past until now. 
Assuming first order Markov properties, this leads to the 
following well-known Bayesian filter step: 

p(xt|Xt) = £(Xt|xt) ■p(xt-i|Xt-i) (3) 

where C{Xt\kt) is the combined likelihood defined in (2). 

D. Incorporating prior knowledge 

Interpreting place categorization as a Bayesian estima¬ 
tion problem allows us to incorporate other sources of 
information in a very natural way. For instance we might 
know that many of the 205 categories Places205 can 
recognize are unlikely or even impossible to be observed 
in the environment the robot is deployed in. This kind of 
knowledge p(x) can be easily incorporated by an additional 
prior term: 

p(xt|Xt) = p(x) ■ £(Xt|xt) ■ p(xt_i|Xt_i) (4) 

E. Semantic mapping 

The place categorization component described in the pre¬ 
vious section creates a probability distribution p(x^|Xt) over 
the known class labels, given the current camera image 2^. 
This continuous stream of classification results is the input to 
the semantic mapping component, along with a laser range 
scan which aids the map building. To build a combined 
semantic and metric map of the environment, we apply the 
occupancy grid mapping algorithm and maintain one map 
layer per semantic category. Fig. 2 illustrates this concept. 
Instead of expressing the probability of being occupied 
or free, each cell in these semantic layers expresses the 
probability of belonging to a certain semantic category. This 
is achieved by propagating the class probability P{\i\zt) 
along the laser rays that are within the field of view of the 
camera and updating the penetrated map cells using the usual 



Fig. 2: An illustration of the semantic map structure. Level 
zero of the map is an regular occupancy grid but the higher 
levels encode the probabilities that a grid cell belongs to a 
certain semantic category. Each layer in the map represents 
one semantic category. The cells are updated according to 
equation (6). 

recursive Bayes filter update method [31] for occupancy 
maps. This way, all unoccupied cells that are within 5 meters 
of the robot’s position and within the camera’s field of view 
are updated with the currently observed semantic label. The 
probability that a cell k is representing a place category x^ 
given the classification result zi^t are : 

Pk{Xi\Il:t) = 

1 -Pfc(Xi|Xt) 1 -pfc(Xi|Xl,t_l) P{±i) 

p{Xi\lt) p(Xi|Xl;t_l) l-p(Xi)^ 

where pfe(xj|Xt) is the prohahility that cell k is of place 
category x^ given the currently observed image It, while 
Pk{'^i\It-i) is the previous estimate and p(x^) is a prior 
probability. For better performance, our implementation uses 
a log-odds representation, which results in the following 
simple update equation [32]: 

Xfe(xi|Xi.t) = X(xi|Xi.t_i) 4- (6) 

where £(xi) = log Finally, a clamping step ensures 

that /min < < ^max- Notice that spurious false 

classifications do not permanently corrupt the map since the 
probabilities within the cells are adapted gracefully and can 
be corrected with later observations. The resulting map can 
be used in a variety of tasks. For instance for path planning 
different traversal costs can be assigned to different labels to 
make the robot avoid busy places during certain times of the 
day (e.g. the food court during lunch time). 

IV. Experiments, Evaluation and Results 
A. Place Categorization on a Real Robot 

We demonstrate and evaluate the place categorization 
performance of our proposed system on a real robot on 
our university’s campus. We use the MobileRobots Research 
GuiaBot shown in Figure 3 and test the system on the images 
of three types of cameras that are mounted on the robot: 
1) the RGB camera from the Microsoft Kinect (version 1) 
sensor, 2) Point Grey Grasshopper monochrome camera and 
3) the front facing camera of the Ladybug2, a spherical 
camera. The test dataset was collected by teleoperating the 
robot across nine different and versatile environments on our 
campus and recording the images from all three cameras as 

























Fig. 3: The Guiabot robot used to evaluate the system and 
example images from all three cameras in a variety of places: 
Kinect RGB (color), Grayscale, and Ladybug (portrait for¬ 
mat). Notice that all images are resized to a fixed size of 
231 X 231 before calculating the ConvNet features. While 
the change in aspect for the RGB and Grayscale images is 
minor, the Ladybug image gets squeezed significantly. Also 
notice the low quality of the Ladybug image. 


Environment 

RGB 

Grayscale 

Ladybug 

corridor 

100.0% 

98.4% 

98.2% 

office Sll 

94.4% 

96.4% 

97.0% 

parking garage 

90.8% 

86.0% 

98.0% 

foodcourt 

84.2% 

65.6% 

48.2% 

cafe outside 

66.6% 

62.2% 

53.1% 

shop 

53.3% 

56.3% 

45.9% 

lobby 

49.1% 

— 

31.2% 

lecture theater 

44.5% 

38.0% 

35.2% 

outdoor 

33.3% 

3.3% 

55.9% 

weighted average 

67.7% 

61.6% 

59.8% 


TABLE I: Accuracies for the different cameras in the QUT 
Gardens Point Campus dataset. The bottom row gives the 
average accuracy weighted by the number of recorded frames 
for each environment. Notice that there were no grayscale 
images recorded for the lobby environment. 


well as laser scans and odometry. The traversed environments 
are listed in Table 1. Since we know the robot is operating 
on the campus, only the following set of semantic classes 
could possibly be encountered: X = {corridor, classroom, 
office, parking Jot, restaurant, food_court, kitchen, kitch¬ 
enette, lobby, supermarket, clothing_store, botanicaLgarden, 
coffee_shop}. We incorporated this prior information as de¬ 
scribed in Section III-D. The prior probabilities of all classes 
not in X were set to zero. We measure the top-1 classification 
accuracy for each of the nine environments separately for the 
images captured by the three different types of cameras on 
the robot. The classifier results were checked by a human 
expert and the resulting accuracies are summarized in Table 
I. The RGB camera (from the Kinect 1 sensor) produces the 
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Fig. 4: Comparing maximum a posteriori and maximum like¬ 
lihood solutions on an indoor dataset. Each color corresponds 
to a different class label. The plots show that the Bayes 
filter (top) smoothes out spurious classification results that 
would occur when trusting the maximum likelihood solution 
(middle). The bottom plot shows the highly fiuctuating ML 
solution without incorporating prior domain knowledge (i.e. 
reducing the prior probability of observing certain scene 
classes). 


best results, presumably because the camera characteristics 
are more similar to the cameras the training set was captured 
with. Our system performed well with grayscale images, 
with average accuracy 6.1% below that of the RGB camera 
and only performing much worse on the outdoor dataset. 
This indicates that color is not an important cue for scene 
classification in many indoor environments. The ladybug 
camera performed worst. We account that to the extreme 
deformations that occur (see Fig. 3) when squeezing the 
Ladybug’s upright format images into a squared input image 
for Places205. 

B. The Effect of the Bayes Filter 

Fig. 4 illustrates the positive effects of the Bayes filter 
that enforces temporal coherence and incorporates prior 
knowledge. The maximum a posterior solution is much more 
stable then the maximum likelihood solution alone. Spurious 
results are smoothed out, reducing false classifications. The 
bottom plot shows the maximum likelihood solution when 
not incorporating prior domain knowledge, and using a 
uniform prior probability over all 205 class labels instead. 

C. Semantic Mapping Results 

Fig. 5 shows the output of our semantic mapping system 
using the RGB images in the nine different tested envi¬ 
ronments on campus. The map are color coded where the 
only the winning labels are rendered. The percentage values 
given in the figure correspond to the fraction of correctly 
classified images listed in Table I, but do not necessarily 
refiect the correctly classified map area. Each place has 
a main category label that describes the place in general, 
however, all the nine places contain mixed categories of 
places. For example, the office environment, as shown in 
the left middle in Fig. 5 contains actual offices (orange) that 
are connected by a long corridor (light green) and a kitchen 
area (brown). Similarly, not all areas in the supermarket 
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Fig. 5: Maps created by the semantic mapping system in 
nine different parts of our campus (not drawn to scale). 
The percentages refer to the fraction of correctly classified 
images, not to the correctly labeled map area. See the text 
for further explanations. 


environment have been assigned the label supermarket 
(yellow). A large area in the lower right can be seen colored 
in dark blue, representing a clothing store. Manual 
inspection confirmed this classification result, since the store 
actually sells clothes and has them on display in this part of 
the shop. Towards the top and the right of the supermarket 
map, large windows lead out to the open campus. The 
system created incoherent classifications and assigned the 
labels botanical garden and parking lot. Another particular 
challenging map is that of the food court. The classifier 
correctly assigned the label the restaurant (purple) in 
areas with tables and chairs and food_court (pink) when 
the robot faces the actual food stalls. The transition areas 
have been labeled as lobby. Given that our test dataset 
contained many very challenging and cluttered scenes, we 
are very content with the classification and mapping results 
produced by our system. 

D. Expanding the Classifier with New Classes 

As discussed in Section III-B, the ability of adding new 
classes to the classification system is crucial for long¬ 
term operation. We expanded our system by adding a new 
class door that Places205 cannot detect. We randomly 
selected 80 positive examples from the elevator-door^ 
category of the SUN-3 97 dataset [10], and 26,000 negative 
images from other categories. Training the classifier on a 
standard desktop machine using Python’s scikit-learn 

^There is no proper door category in the dataset. 


implementation takes only 57 seconds on a desktop machine. 
Although the one-vs-all classifier was trained on elevator 
doors, it reliably detects the doors that are typically found 
in offices and corridors and so on. We tested this on another 
dataset (1332 images) we collected in our office environment, 
where the robot was driven through multiple offices and 
corridors. The system detected 8 of the 10 doors in the 
environment and produced two false positive detections. The 
two false negatives can be accounted to rapid camera motion 
(the door was only visible for 4 frames) and extremely bad 
lighting conditions (the door was at the end of a long unlit 
corridor). For one of the two false positive detections the 
classifier responded to a wooden plank on the wall. 

V. Case Studies: Applications of Semantic 
Information in a Robotics Context 

A. Semantic Place Information Boosts Object Detection 

The semantics of a place provide valuable information 
that can boost the performance of object detection and 
recognition on a robot. We demonstrate this by running an 
object detection pipeline inspired by [24] that consists of an 
object proposal step (using EdgeBoxes [33]) and a ConvNet 
classifier (AlexNet [21] as imlemented in Caffe [29]). 
Similar to Places205, AlexNet provides a discrete prob¬ 
ability distribution over the 1000 object classes it was trained 
on. We denote this as p(c|7r), where tt is the image patch the 
classifier is applied on. Depending on the semantic context, 
different object classes are more likely to be observed than 
others. E.g. in a kitchen we expect to see cups and mugs, 
but not a motor bike. We propose to exploit such knowledge 
in a naive Bayes classifier: 

p(c|7r,x*) (xp(c|7r) •p(c|;f*) (7) 

where x* is the maximum-a-posteriori solution of the se¬ 
mantic place category for the currently observed scene and 
we assumed independence between tt and f * and uniform 
prior probabilities. The terms p(ci|x*) express the likelihood 
to observe an object of class q in a scene with label x*. We 
learned these priors from the NYU2 dataset [34] by analysing 
the ground truth labels and building statistics over the relative 
occurrences of object classes for every scene type. Non¬ 
occurring object classes or classes that do not appear in both 
the NYU2 and the ImageNet datasets were given a small but 
non-zero default prior probability. We tested the combination 
of our semantic place categorization and object detection on 
a robot in a kitchen environment. We found the robot was 
more accurate in its object classifications when it had the 
additional semantic information available and could apply 
equation (7). Eusing both streams of information increased 
the top-1 accuracy of correctly detected objects from 0 % 
to 54 % and the top-5 accuracy from 15% to 100% in this 
experiment. Examples of the boosted object detection results 
are shown in Eig. 6. 

B. Path Planning on a Semantic Map 

We demonstrate how the semantic map created by our 
system can modulate the behaviour of a robot in an indoor 





AlexNet results 


Boosted results 



Fig. 6: Case study I: Semantic place information significantly 
boosts object recognition performances. The examples illus¬ 
trate how the top-5 classification results by AlexNet on 
objects in a kitchen scene can be improved by incorporating 
prior knowledge that is conditioned on the current semantic 
scene class. Left: original results, right: results when exploit¬ 
ing the semantic mapping system and boosting the object 
classifier with prior probability p{ci\x*). See text for details. 


navigation scenario. In our example a robot has to navigate 
from its current position to another place in a workplace 
environment with offices and corridors. During working 
hours the robot should try to avoid disturbing humans in 
their office spaces and rather take longer detours through 
the corridors. During night times, the shortest path is always 
preferable. Such scenarios can be easily implemented when 
performing the path planning in the semantic map. Different 
class labels in the map can be assigned different cost values 
that are used by a path planner. Fig. 7 compares the results 
of an A* path planner when avoiding offices and when 
preferring the shortest path. 

VI. The Provided ROS Module 

We provide the semantic place categorization system and 
the semantic mapper as a ROS module to the community. 
The module consists of two nodes. One node subscribes to 
an image topic and interfaces the Caffe framework [29] 
that implements the Places205 ConvNet. The network 
provides a probability distribution over the 205 known scene 
types. On a Nvidia Quadro K4000 GPU, this operation takes 
0.031 seconds for a 320 x 240 image. During the classifi¬ 
cation process, features from layer fc7 are extracted and 
passed through the one-vs-all classifiers to detect additional 
classes. This step requires additional 3 ms per classifier using 
a Python implementation of a random forest classifier on 
a desktop machine. After passing the combined likelihoods 
through the Bayes filter, the posterior probability distribution 
is published as a ROS topic. A second node subscribes 
to this topic, the data of a laser range finder, the robot 
pose estimate and a grid map created by a SLAM system 



Fig. 7: Case study II: Path planning on a semantic map. 
Semantic information can be used to modulate the robot’s 
behaviour by translating place labels to different cost values 
for the path planner. This way the robot avoids office spaces 
during working hours (top) but follows the shortest path 
during the night (bottom). 


such as gmapping. Our node fuses all these information and 
creates a 3-dimensional map structure (based on OctoMap) 
where each layer contains a probability map for a specific 
semantic class. The complete system can be dowloaded from 
http://tinyurl.com/semantic-mapping-QUT, where additional 
information and documentation can be found. This ROS 
package provides a readily usable semantic categorization 
and mapping system to the community that can be deployed 
without environment-specific training. 

VII. Conclusion and Future work 

Our paper introduced a novel transferable and expand¬ 
able place categorization and semantic mapping system that 
requires no environment-specific training, can be expanded 
with new classes, and is embedded in a Bayesian filter frame¬ 
work that allows the incorporation of prior information and 
enforces temporal coherence of the classification results. Se¬ 
mantic information about the environment is an important en¬ 
abler of more advanced robotic tasks, especially for human- 
robot collaboration. Humans describe places, goals, and 
objects using semantic categories and it is natural for them to 
formulate tasks using these categories. We demonstrated how 
semantic information can infiuence robotic navigation tasks 
in a workplace, making robot operations more compliant 
with human needs. A second case study demonstrated how 
semantic information supports robotic object detection and 


























increases the performance of this equally important visual 
recognition problem. In future work we will extend the 
system to support multi-scale or sub-scene categorization, i.e. 
assigning labels to parts of the scene. We will also explore 
how semantic mapping can guide visual place recognition by 
partitioning the search space to similar semantic places and 
apply the system to various robotic tasks in the real world. 
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